Study guide
PSY715 – Research design & statistics
This covers aspects of the class considered important and not self-evident, but it should not be considered as the only document to review.
This document may be updated at any time. Please check the date at the top of the document to see if it has been updated since your last visit.
1 What (not) to review
For exams, you do not need to know or apply the following:
- Calculations of degrees of freedom, standard errors and confidence intervals
- Calculations of Cramer’s V, Cohen’s d, standard deviation and variance
- Rules of thumb for effect sizes (i.e., thresholds for small, medium, large effects)
- Execution of the analyses in
SPSSor another software
You should be familiar with the concepts and interpretations of the above items, as well as deciisions regarding their use in research, but you will not be required to perform calculations.
2 Research design
2.1 Types of research design
2.1.1 Quantitative vs. Qualitative
- Quantitative research design
Involves the collection and analysis of numerical data using statistical methods. Emphasizes objective measurements, standardized instruments, and statistical analyses to establish relationships and make generalizations.
- Qualitative research designs
Focuses on understanding and interpreting subjective experiences, meanings, and social contexts. Utilizes methods such as interviews, observations, and textual analysis to generate rich and in-depth descriptions and explore complex phenomena.
2.1.2 Observational vs. Experimental
- Observational research design
Involves observing and documenting behavior or phenomena in their natural settings without any intervention or manipulation by the researcher. Focuses on describing and understanding relationships, patterns, and behaviors as they naturally occur.
- Experimental research design
Involves manipulating variables and randomly assigning participants to different conditions to establish cause-and-effect relationships. Allows researchers to control and manipulate independent variables while measuring the effects on dependent variables to make causal inferences.
2.1.3 Inductive vs. Deductive
- Inductive Research Design:
- Starts with observations and data.
- Generates theories or generalizations based on patterns or trends identified in the data.
- Moves from specific observations to broader conclusions or theories.
- Deductive Research Design:
- Starts with theories, hypotheses, or existing knowledge.
- Tests specific hypotheses or predictions derived from theories.
- Involves collecting data to confirm or refute the hypotheses.
- Moves from general theories or hypotheses to specific observations.
2.2 Roles of variables
- Independent Variable (IV):
- Manipulated or controlled by the researcher.
- Changes intentionally.
- Hypothesized cause or predictor.
- Dependent Variable (DV):
- Measured or observed outcome.
- Affected by the independent variable.
- Variable of interest.
- Extraneous Variable:
- Variable(s) that may influence the DV.
- Not intentionally manipulated by the researcher.
- If not controlled for, prevents the researcher from determining if variations of the DV are due to the IV or to the extraneous variable.
- Need to be identified and controlled for to ensure accurate interpretation of the relationship between the IV and DV.
2.3 Levels of measurement
- Nominal:
- Categorical data without any inherent order or numerical value.
- Examples: Gender, eye color, marital status.
- Ordinal:
- Categorical or ordered data with a relative ranking or order.
- Categories have a meaningful order, but the differences between them may not be equal.
- Examples: Likert scales, educational levels (e.g., elementary, middle, high school).
- Interval:
- Numerical data with equal intervals between values.
- No true zero point.
- Arithmetic operations like addition and subtraction can be performed.
- Examples: Temperature in Celsius or Fahrenheit.
- Ratio:
- Numerical data with equal intervals between values and a true zero point.
- All arithmetic operations can be performed.
- Examples: Height, response time, income.
2.4 Qualities of research designs
- Internal validity:
- Refers to the extent to which a study accurately measures the cause-and-effect relationship between variables.
- Involves controlling potential confounding factors and ensuring that changes in the dependent variable are due to the manipulation of the independent variable.
- External validity:
- Represents the generalizability of research findings to the broader population or real-world contexts.
- Considers factors such as sample representativeness, research settings, and the ecological validity of the study.
- Ethical soundness:
- Involves ensuring that research adheres to ethical principles and guidelines.
- Includes obtaining informed consent from participants, protecting their rights and privacy, and minimizing any potential harm or discomfort.
- Measurement reliability:
- Refers to the consistency and stability of measurement or data collection procedures.
- Measurement validity:
- Represents the extent to which a study measures what it intends to measure or assess.
2.5 Sampling biases
- Convenience sampling bias:
- Occurs when participants are selected based on their availability or convenience.
- Can lead to a non-representative sample that may not accurately reflect the target population.
- Self-selection bias:
- Arises when individuals voluntarily choose to participate in a study.
- Can introduce bias as those who self-select may have unique characteristics or motivations that differ from the general population.
- Sampling bias due to non-response:
- Occurs when selected participants decline or fail to respond to the study’s invitation.
- Can result in a biased sample if non-responders have different characteristics than responders.
- Attrition bias:
- Occurs when there is a differential loss of participants or data during the course of a study.
- Can introduce bias if attrition is related to the variables being studied and can affect the representativeness and validity of the findings.
- Sampling bias due to small sample size:
- Occurs when the sample size is too small to represent the target population adequately.
- Findings based on small samples may not generalize well and can be more susceptible to chance variations.
- Sampling bias due to non-random selection:
- Arises when participants are not randomly selected from the population of interest.
- Can lead to a sample that is not representative and limits the generalizability of the study’s findings.
3 Univariate statistics
3.1 Location / Central tendency
Measures of location inform us about the typical observation for a variable.
3.1.1 Mode
The mode is the observation with the largest frequency.
3.1.2 Median
The median is the observation that splits the ordered observations into 2 equal sized groups. It also the 50% percentile and 2nd quartile.
3.1.3 Mean
The mean \(\bar{y}\) is the sum of observations over the sample size.
\[\bar{y} = \frac{\sum y}{N}\]
Its value in the population is by convention written \(\mu\).
3.2 Measures of dispersion
Measures of dispersion inform us about how scattered the observations are. They consequently also inform us about how accurate measures of location are.
3.2.1 Range
The range is the distance between the maximum and the minimum.
3.2.2 Standard deviation
The standard deviation \(s\) is the typical distance to the mean.
\[s = \sqrt{ \frac{\sum (y - \bar{y})^2}{N-1}}\]
Its value in the population is by convention written \(\sigma\).
3.2.3 Variance
The variance \(s^2\) is the square of the standard deviation.
\[s^2 = \frac{\sum (y - \bar{y})^2}{N-1}\]
Its value in the population is by convention written \(\sigma^2\).
4 \(\chi^2\) tests
4.1 One sample \(\chi^2\) test
The one sample \(\chi^2\) test is used to test whether the distribution of a categorical variable in a sample is different from a hypothesized distribution.
4.1.1 Example situations in psychology
- We study the prevalence of a condition in a sample and want to test whether it is different from the prevalence in the general population.
- We study the success rate to a test in a sample and want to test whether it is different from the success rate in a reference population.
- We study the distribution of a categorical demographic variable in a sample and want to test whether it is different from the distribution in the general population to assess the representativeness of the sample.
4.1.2 Null hypothesis
The null hypothesis \(H_0\) is that the distribution of the categorical variable in the population is the same as the hypothesized distribution.
4.1.3 Alternate hypothesis
The alternate hypothesis \(H_1\) is that the distribution of the categorical variable in the population is different from the hypothesized distribution.
4.1.4 Test statistic
The test statistic \(\chi^2\) is the sum of the squared differences between the observed and expected frequencies:
\[\chi^2 = \sum \frac{(O - E)^2}{E}\]
where \(O\) is the observed frequency and \(E\) is the expected frequency.
In this procedure, the expected frequency is the hypothesized proportion multiplied by the sample size.
4.1.5 \(p\)-value
The \(p\)-value is the probability of observing a \(\chi^2\) at least as extreme as the one observed (i.e., the probability to observe frequencies as different from the expected frequencies), assuming that the null hypothesis is true.
4.1.6 Typical reporting
We typically report the test statistic, the degrees of freedom, and the \(p\)-value.
We conducted a \(\chi^2\) test of independence to examine whether the distribution of [categorical variable] was different from the hypothesized distribution (explain what it is). The distribution of [categorical variable] was different from the hypothesized distribution, \(\chi^2(df) = \chi^2_\text{observed}, p = p_\text{observed}\).
4.2 Independent samples \(\chi^2\) test
The independent samples \(\chi^2\) test is used to test whether the distribution of a categorical variable is different between several groups.
4.2.1 Example situations in psychology
- We study the prevalence of a condition in several groups and want to test whether it is different between the groups.
- We study the success rate to a test in several groups and want to test whether it is different between the groups.
- We study the distribution of a categorical demographic variable in several groups and want to test whether it is different between the groups.
4.2.2 Null hypothesis
The null hypothesis \(H_0\) is that the distribution of the categorical variable is the same between the groups.
4.2.3 Alternate hypothesis
The alternate hypothesis \(H_1\) is that the distribution of the categorical variable is different between the groups.
4.2.4 Contingency table
The contingency table is a table that shows the observed frequencies of the categorical variable in each group. For example:
| Group | Category 1 | Category 2 | Total |
|---|---|---|---|
| A | 20 | 30 | 50 |
| B | 30 | 20 | 50 |
| C | 40 | 40 | 80 |
| Total | 90 | 90 | 180 |
In this example, the categorical variable has 2 categories and there are 3 groups.
The frequency of a category in a group is referred to as a conditional frequency. For example, the conditional frequency of category 1 in group A is 20.
The total frequency of a category is referred to as a marginal frequency. For example, the marginal frequency of category 1 is 90 and the marginal frequency of group A is 50.
4.2.5 Test statistic
The test statistic \(\chi^2\) is the sum of the squared differences between the observed conditional frequencies and the expected conditional frequencies.
\[\chi^2 = \sum \frac{(O - E)^2}{E}\]
where \(O\) is the observed frequency and \(E\) is the expected frequency.
The expected conditional frequency for a cell in the contingency table is the product of the marginal frequencies corresponding to the cell, divided by the sample size. For example here, the expected conditional frequency of category 1 in group A is \(\frac{90 \times 50}{180} = 25\).
4.2.6 \(p\)-value
The \(p\)-value is the probability of observing a \(\chi^2\) at least as extreme as the one observed (i.e., the probability to observe frequencies as different from the expected frequencies), assuming that the null hypothesis is true.
4.2.7 Effect size
Cramér’s \(V\) is a measure of effect size for \(\chi^2\) tests. It is the square root of the \(\chi^2\) divided by the sample size and the number of categories minus 1.
\[V = \sqrt{\frac{\chi^2}{N(k-1)}}\]
where \(N\) is the sample size and \(k\) is the number of categories.
Rules of thumb for interpreting \(V\):
| Cramér’s \(V\) | Interpretation |
|---|---|
| 0.00 to 0.09 | Negligible/Null effect size |
| 0.10 to 0.29 | Small/weak effect size |
| 0.30 to 0.49 | Medium/Moderate effect size |
| 0.50 and above | Large/Strong effect size |
4.2.8 Typical reporting
We typically report the test statistic, the degrees of freedom, the \(p\)-value, and the effect size (Cramér’s \(V\)).
We conducted an independent samples \(\chi^2\) test to examine whether the distribution of [categorical variable] was different between the groups (explain what they are). The distribution of [categorical variable] was different between the groups, \(\chi^2(df) = \chi^2_\text{observed}, p = p_\text{observed}, V = V_\text{observed}\).
5 Linear models
5.1 General formulation and structural assumption
Linear Models (LM) are a class of models, in which an outcome variable \(Y\) is predicted as a linear function of \(p\) predicting variables \(X_1, X_2,...,X_p\).
For the \(i^\text{th}\) case, we have:
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + ... + \beta_px_{pi} + \epsilon_i\]
\(\beta_0\), \(\beta_1\), …, \(\beta_p\) are fixed real numbers to estimate, called model parameters.
\(\beta_0\) is referred to as the intercept.
\(\beta_1\), \(\beta_2\), …, \(\beta_p\) are referred to as the slope(s).
5.2 Distributional assumption
The errors, conditional upon predictor values \(x\), are assumed to be randomly drawn from a Gaussian (Normal) distribution of fixed variance \(\sigma^2\):
\[\epsilon_i|x_i \sim \text{Normal}(0,\sigma^2)\]
Alternatively, we can write that the distribution of \(y\), conditional upon predictor values \(x\) follows a Normal distribution of fixed variance \(\sigma^2\):
\[y|x_i \sim \text{Normal}(\mu_i,\sigma^2)\]
5.3 Important particular cases
5.3.1 Intercept-only models
The intercept-only model (or grand mean model, or mean model) is a model that predicts the outcome as a constant (the intercept \(\beta_0\)).
\[y_i = \beta_0 + \epsilon_i\]
The one-sample \(t\) test and the paired-samples \(t\) tests can be seen as intercept-only linear models.
5.3.2 Models with only one slope
Models with an intercept \(\beta_0\) and only one slope (and thus one predictor) \(\beta_1\) also have particularities.
\[y_i = \beta_0 + \beta_1x_{1i} + \epsilon_i\]
Simple regression / Pearson correlation \(r\) and the independent-samples \(t\) test are models with an intercept and a single slope.
5.4 Relation to the mean
The predicted/expected value for \(y\) – noted \(\hat{y}\) or \(E(y)\) – is equal to the sample mean of \(y\) (conditional on the predictors, if any). It is also the maximum likelihood estimator of the population mean (conditional on the predictors, if any).
Thus, a linear model is essentially a model that predicts the mean (of a Normal distribution), conditional upon predictors predicting variables \(X_1, X_2,...,X_p\).
5.5 Parameter estimation
In Linear Models, parameters are in general estimated through Ordinary Least Squares. It consists in finding parameter values so that the sum of squared errors (\(SSE\) or \(SSR\)) is minimized.
5.6 Testing parameters
In general, in linear models, parameters can be significance tested against some comparison value \(\beta_{H_0}\) (most of the time \(0\)), using a \(t\) test (often referred to as a Wald \(t\) test).
The null hypothesis is specified as:
\[H_0 : \beta = \beta_{H_0}\]
The (non-directional) hypothesis is thus:
\[H_1 : \beta \neq \beta_{H_0}\]
Under \(H_0\), the difference between the sample estimate \(\hat{\beta}\) and the comparison value \(\beta_{H_0}\) (frequently \(0\)), divided by the standard error of \(\beta\) follows a Student’s \(t\) distribution, with degrees of freedom \(df = N - p - 1\) (note: \(p\) is here the number of predictors in the model).
\[t = \frac{\hat{\beta} - \beta_{H_0}}{SE_{\beta} } \sim t(N-p-1)\]
The calculation of probabilities for values above and below \(t_\text{observed}\) yield a \(p\) value, which is the probability that \(t\) is greater than \(t_\text{observed}\) in absolute value: This is the \(p\) value.
A \(p\) value smaller than \(.05\) allows to conclude to a significant difference between the parameter and the comparison value. A \(p\) value larger than \(.05\) does not allow to make any conclusion regarding that difference.
5.7 Sums of squares
Sums of squares are used to decompose the variability of the outcome variable (i.e., to do an “analysis of variance”).
The sum of squared errors \(SSE\) quantifies the distance between the predictions of the model \(\hat{y}\) and the observations \(y\): \(SSE = \Sigma(\hat{y} - y)^2\)
The total sum of squares \(SST\) quantifies the distance between the predictions of the simplest model possible (the intercept-only model, or mean model) \(\bar{y}\) and the observations \(y\): \(SST = \Sigma(\bar{y} - y)^2\)
The model sum of squares \(SSM\) quantifies the distance between the predictions of the model \(\hat{y}\) and the predictions of the simplest model possible (the intercept-only model, or mean model) \(\bar{y}\): \(SSM = \Sigma(\hat{y} - \bar{y})^2\)
The intercept-only model has an \(SSM=0\) and an \(SSE = SST\). Thus the \(SST\) is the \(SSE\) of an intercept-only model.
5.8 Coefficient of determination \(R^2\)
\(R^2\) is used to determine the fit of the model to the data. It is computed as the variability explained by the model \(SSM\) over the total variability to explain \(SST\):
\[R^2 = \frac{SSM}{SST}\]
\(R^2\) ranges from \(0\) (lowest fit possible) to \(1\) (perfect fit).
\(R^2\) is generally interpreted as the proportion of variance of \(Y\) explained by the model.
For the intercept-only model, \(R^2=0\). It is trivial and thus not reported.
5.9 \(F\) test of model fit
Model fit can be significance tested. We specify a null hypothesis that indicates that the model does not predict the data (more than an intercept-only model) in the population:
\[H_0 : R^2 = 0\]
As a result the alternate hypothesis is:
\[H_1 : R^2 > 0\]
Under \(H_0\), the ratio of the mean squares of the model (\(SSM/df_M\)) over the mean squares of the errors (\(SSE/df_E\)), named \(F\), follows a Fisher’s \(F\) distribution, of degrees of freedom \(p\) and \(N-p-1\):
\[F = \frac{SSM/df_M}{SSE/df_E} = \frac{MSM}{MSE} \sim F(p, N-p-1)\]
The calculation of probabilities for values above and below \(F_\text{observed}\) yield a \(p\) value, which is the probability that \(F\) is greater than \(F_\text{observed}\): This is the \(p\) value.
A \(p\) value smaller than \(.05\) allows to conclude to a significant difference in fit between the model and the intercept-only model (we generally say that the model “significantly fits”, or that \(R^2\) is significant). A \(p\) value larger than \(.05\) does not allow to make any conclusion regarding model fit.
6 Linear model assumptions
6.1 Conditional normality
6.1.1 What we study
To test conditional normality, depending on the model, we generally study one or more distributions.
If there are no predictors (i.e., intercept-only model), we directly study the distribution of \(Y\).
If the predictor \(X\) is a discrete variable, we can study the distribution of \(Y\) for each (observed) value of \(X\). The ensemble of these distributions of \(Y\) at each \(X\) is referred to as the conditional distribution of \(Y\).
If the predictor \(X\) is a continuous variable, we generally study the distribution of the residuals of the model.
6.1.2 How we study it
Distributions can be studied for normality, notably using:
- histograms, frequency plots, density plots
- normality tests (e.g., Kolmogorov-Smirnov test, Shapiro-Wilk test)
- Should be non-significant (a significant test indicates a significant departure from normality)
- measures of skewness and kurtosis
- Should be close to \(0\)
6.2 Homogeneity of variance
6.2.1 What we study
To test homogeneity of variance (i.e., homoscedasticity), we use different procedures depending on the model. In general:
If there are no predictors (i.e., intercept-only model), homogeneity of variance is trivial (the variance cannot vary as a function of predictors if there isn’t any).
If the predictor \(X\) is a discrete variable, we can study the distribution of \(Y\) (or of the residuals) for each (observed) value of \(X\) (if the assumption is true, we should not have different variances of \(Y\) for different values of \(X\)).
If the predictor \(X\) is a continuous variable, we generally study how the residuals may vary as a function of the predictor (if the assumption is true, we should see be no relation).
6.2.2 How we study it
Homogeneity of variance is notably investigated using the following tools:
- Levene’s test (compares variances across independent groups). A significant test indicates significantly different variances (i.e., violated assumption).
- A box-plot/density plot/histogram by group is often used with it.
- Auxiliary regression (a regression model where the residuals are predicted using the model predictors). A significant test implies a variance that significantly varies as a function of the predictors (i.e., violated assumption).
- A residuals by predicted plot is often used with it.
7 Intercept-only linear models
The following models can be discussed as applications of intercept-only linear models. An intercept-only model is formulated as:
\[y_i = \beta_0 + \epsilon_i\]
Some important note regarding intercept-only linear models:
the population mean of \(Y\), often noted \(\mu\), is equal to the population intercept \(\beta_0\).
the (Maximum Likelihood / Ordinary Least Squares) estimate of the intercept \(\hat{\beta}_0\) is equal to the sample mean \(\bar{y}\).
7.1 One sample \(t\) test
The one-sample \(t\) test is used for situations where we want to compare the mean of a variable \(Y\) with some theoretical value of interest \(\mu_0\) (some reference value, some general population value, a threshold, the central point in a scale, etc.).
7.1.1 Example situations in psychology
- We study the acceptability of or satisfaction with a treatment for a disease. We want to know if the mean acceptability of the treatment is above a threshold of \(5\) on a \(10\)-point scale.
- We study the mean level of anxiety in a population. We want to know if the mean level of anxiety is above the mean level of anxiety in the general population.
- We study the response time to a stimulus. We want to know if the mean response time is below \(500\)ms.
- We want to check the representativeness of a sample with respect to a numeric demographic variable like age. We want to know if the mean age of the sample is representative of the mean age in the population.
7.1.2 Mean differences
A common way to describe the difference between the mean \(\bar{y}\) (which is the sample mean and the estimator for the population mean) and the theoretical value \(\mu_0\) is to compute a (raw) mean difference \(\Delta \bar{y}\):
\[\Delta \bar{y} = \bar{y} - \mu_0\]
This difference is expressed in the original units of the variable \(Y\): A mean difference of \(.4\) means that the mean \(\hat{y}\) is \(.4\) units higher than the reference value \(\mu_0\).
But in a lot of cases in the social sciences, these units are arbitrary and/or meaningless. Therefore, it is frequent to express the mean difference in standard deviations. For this, we compute the Standardized Mean Difference (\(SMD\), also referred to as Cohen’s \(d\)):
\[SMD = d= \frac{\Delta \bar{y}}{\sigma}\]
If the population standard deviation \(\sigma\) is known (in practice, it is rarely the case), it is used in the formulation. More frequently, we replace it with its estimator, which is the sample standard deviation \(s\).
A standardized mean difference of \(.2\) means that the mean \(\hat{y}\) is \(.2\) standard deviations higher than the reference value \(\mu_0\).
The standardized mean difference is the most common measure of effect size in this context.
| Cohen’s d (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.19 | Negligible effect size |
| 0.20 to 0.39 | Small/weak effect size |
| 0.40 to 0.69 | Medium/Moderate effect size |
| 0.70 and above | Large/Strong effect size |
7.1.3 Null hypothesis
In a one-sample \(t\) test, the null hypothesis is formulated so as to imply no mean difference in the population:
\[H_0 : \mu = \mu_0\]
Alternatively, we can write:
\[H_0 : \mu - \mu_0 = 0\]
7.1.4 Alternate hypothesis
The non-directional hypothesis states that the mean differs from the reference value in the population:
\[H_1 : \mu \neq \mu_0\] Which can be written as:
\[H_1 : \mu - \mu_0 \neq 0\]
A directional alternate hypothesis specifies a direction for that inequality:
\[H_1' : \mu > \mu_0 \text{ or } H_1' : \mu < \mu_0\]
If using a directional hypothesis, the direction must be specified prior to analysis.
7.1.5 The \(t\) statistic
Under \(H_0\), the mean difference is null in the population, which implies that the following \(t\) statistic…
\[t = \frac{\bar{y} - \mu_0}{ SE_\bar{y} }\]
…follows a Student’s \(t\) distribution in the sample, with degrees of freedom \(df = N - 1\).
\(SE_\bar{y}\) is the standard error of the mean.
7.1.6 \(p\) value
The observed \(t\) value is located in the \(t(df)\) distribution. The probability that \(t\) is greater in absolute value than the \(t_\text{observed}\) is the \(p\)-value.
- If \(p<.05\), we reject \(H_0\), and therefore conclude that the population mean is different from \(\mu_0\) (i.e., there is a significant mean difference)
- If \(p>.05\), we cannot reject (or confirm) the null hypothesis, and thus cannot conclude (i.e., the mean difference is non-significant).
For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu > \mu_0\) the \(p\)-value is the probability that \(t\) is greater than \(t_\text{observed}\). For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu > \mu_0\) the \(p\)-value is the probability that \(t\) is smaller than \(t_\text{observed}\).
7.1.7 Relation to linear models
The one-sample \(t\) test is equivalent to the Wald \(t\) test of the intercept parameter in an intercept-only model, in which the comparison value \(\beta_{H_0} = \mu_0\).
\[y_i = \beta_0 + \epsilon_i\]
In a Wald test of \(\beta_0\), we would have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE_{\beta_0}}\]
Since \(\beta_0\) corresponds to the population mean, \(\hat{\beta}_0\) to its sample estimate, and \(\beta_{H_0} = \mu_0\), we have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE_{\beta_0}} = \frac{\bar{y} - \mu_0}{SE_{\bar{y}}} \]
7.1.8 Assumptions
The assumption of normality is here directly tested using the sample distribution of \(Y\), using the usual tools (density plot, histogram, Shapiro-Wilk test, etc.).
The assumption of homogeneity of variance is not tested because there is no predictor that would make variance heterogeneous.
7.1.9 Typical reporting
We typically report the \(t\) statistic, the degrees of freedom, and the \(p\)-value.
A one-sample \(t\) test was conducted to compare the mean of \(Y\) to the reference value \(\mu_0\). The mean of \(Y\) was significantly higher than \(\mu_0\), \(t(N-1) = t_\text{observed}, p < .001\).
7.2 Paired-samples \(t\) test
The paired-sample \(t\) test is used for situations where we want to compare two repeated measures (e.g., a measure used at two points on the same persons). We will name these repetitions \(1\) and \(2\) throughout.
7.2.1 Example situations in psychology
- A measure of anxiety is taken before and after a treatment, and we want to know if the treatment had an effect on anxiety.
- A test is taken under two different conditions (e.g., on computer vs. on paper), and we want to know if the condition had an effect on test performance.
7.2.2 Paired differences
Let us note \(\Delta Y\) the paired differences variable, which is defined as the differences between the two repetitions \(1\) and \(2\), such as:
\(\Delta Y = Y_2 - Y_1\)
Some software compute \(\Delta Y\) as \(Y_1-Y_2\), others as \(Y_2-Y_1\).
| Case | \(Y_1\) | \(Y_2\) | \(\Delta Y = Y_2 - Y_1\) |
|---|---|---|---|
| 1 | 10 | 15 | 5 |
| 2 | 12 | 18 | 6 |
| 3 | 8 | 11 | 3 |
| 4 | 9 | 14 | 5 |
| 5 | 11 | 16 | 5 |
| … | … | … | … |
7.2.3 Mean differences
A common way to describe the difference between the two means \(\bar{y}_1\) and \(\bar{y}_2\) is to compute a (raw) mean difference \(\overline{\Delta y}\):
\[\overline{\Delta y} = \frac{\sum \Delta y}{N}\]
This difference is expressed in the original units of the variable \(Y\).
The difference of two means \(\bar{y}_2 - \bar{y}_1\) is equal to the mean of the differences \(\overline{\Delta y}\). This is not true of many other statistics however (e.g., median, standard deviation). Technically here, we are studying the mean of the differences.
In a lot of cases in the social sciences, these units are arbitrary and/or meaningless. Consequently, it is frequent to express the mean difference in standard deviations. To do this, we compute the Standardized Mean Difference (\(SMD\), also referred to as Cohen’s \(d\)):
\[SMD = d= \frac{\overline{\Delta y}}{s_{\Delta y}}\]
For a better estimation of the standardized mean difference, we would prefer to use the population standard deviation of the differences \(\sigma_{\Delta y}\) . In practice, however, it is never known or assumed. Therefore, its estimator, the sample standard deviation of the differences \(s_{\Delta y}\) is always used instead.
The standardized mean difference is the most common measure of effect size in this context.
| Cohen’s d (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.19 | Negligible effect size |
| 0.20 to 0.39 | Small/weak effect size |
| 0.40 to 0.69 | Medium/Moderate effect size |
| 0.70 and above | Large/Strong effect size |
7.2.4 Null hypothesis
In a paired-samples \(t\) test, the null hypothesis is formulated so as to imply no mean difference in the population:
\[H_0 : \mu_1 = \mu_2\]
Alternatively, we can write:
\[H_0 : \mu_2 - \mu_1 = 0\]
7.2.5 Alternate hypothesis
The non-directional hypothesis states that there is a difference between the two means in the population:
\[H_1 : \mu_1 \neq \mu_2\] Which can be written as:
\[H_1 : \mu_2 - \mu_1 \neq 0\]
A directional alternate hypothesis specifies a direction for that inequality:
\[H_1' : \mu_1 > \mu_2 \text{ or } H_1' : \mu_1 < \mu_2\]
If using a directional hypothesis, the direction must be specified prior to analysis.
7.2.6 The \(t\) statistic
Under \(H_0\), the mean difference is null in the population, which implies that the following \(t\) statistic…
\[t = \frac{\Delta Y }{SE_{\Delta Y}} = \frac{\Delta Y }{s_{\Delta y}/\sqrt{N}}\]
…follows a Student’s \(t\) distribution in the sample, with degrees of freedom \(df = N - 1\).
\[t \sim t(N-1)\]
\(SE_{\Delta Y}\) is the standard error of the mean of the differences.
7.2.7 \(p\) value
The observed \(t\) value is located in the \(t(N-1)\) distribution. The probability that \(t\) is greater in absolute value than the \(t_\text{observed}\) is the \(p\)-value.
- If \(p<.05\), we reject \(H_0\), and therefore conclude that the means are different in the population (i.e., there is a significant mean difference)
- If \(p>.05\), we cannot reject (or confirm) the null hypothesis, and thus cannot conclude (i.e., the mean difference is non-significant).
For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu_2 > \mu_1\) the \(p\)-value is the probability that \(t\) is greater than \(t_\text{observed}\). For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu_2 < \mu_1\) the \(p\)-value is the probability that \(t\) is smaller than \(t_\text{observed}\).
7.2.8 Relation to linear models
The paired-sample \(t\) test is equivalent to the Wald \(t\) test of the intercept parameter in an intercept-only model, in which the comparison value \(\beta_{H_0} = 0\), and in which the predicted variable consists of the differences \(\Delta Y\):
\[\Delta y_i = \beta_0 + \epsilon_i\]
In a Wald test of \(\beta_0\), we would have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE(\beta_0)}\]
Since \(\beta_0\) corresponds to the population mean (of \(\Delta Y\)), \(\hat{\beta}_0\) to its sample estimate (i.e., \(\overline{\Delta Y}\)), and \(\beta_{H_0} = 0\), we have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE(\beta_0)} = \frac{\overline{\Delta Y}}{SE_{\Delta Y}} \]
7.2.9 Assumptions
The assumption of normality is here directly tested using the sample distribution of \(\Delta Y\), using the usual tools (density plot, histogram, Shapiro-Wilk test, etc.).
The assumption of homogeneity of variance is not tested because there is no predictor that would make variance heterogeneous.
7.2.10 Typical reporting
We typically report the \(t\) statistic, the degrees of freedom, and the \(p\)-value. We also report the mean difference, the standard deviation of the differences, and the standardized mean difference. For example:
A paired-samples \(t\) test revealed a significant difference between the two conditions, \(t(N-1) = t_\text{observed}, p = p_\text{observed}\), \(d = d_\text{observed}\). The mean in condition 1 was \(M_1 = M_1\) was larger than the mean in condition 2, \(M_2 = M_2\).
7.2.11 Common graphical representations
- Box plots
- Means with 95% Confidence Intervals
- Observations connected by case (“Spaghetti plot”)
8 Linear models with numeric predictors
8.1 Simple linear regression
Simple linear regression is used for situations where we have one numeric predictor (IV) \(X\) used to predict one numeric outcome (DV) \(Y\).
The relation between the two variables is assumed to be linear:
\[y_i = \beta_0 + \beta_1x_{i} + \epsilon_i\]
Simple regression is thus a Linear Model.
8.1.1 Example situations in psychology
- We want to study the relation between the score on a personality test (e.g., an extraversion measure) and a numeric psychological outcome (e.g., the score on a depression scale).
- We want to study the relation between the score on a psychological test (e.g., an anxiety measure) and a numeric physiological outcome (e.g., the level of cortisol in saliva).
- We want to study the relation between a numeric demographic predictor (e.g., age) and a numeric psychological outcome (e.g., the score on a memory test).
8.1.2 Parameter estimation
Like all linear models, the intercept and slope are estimated through ordinary least squares, which consist in minimizing the Sum of Squared Errors (\(SSE\)).
8.1.3 Interpretation
The (unstandardized) intercept estimate \(\hat{\beta_0}\) is the predicted value for \(y\) when \(x_1 = 0\) (all in the original units of \(x\) and \(y\) )
The (unstandardized) slope estimate \(\hat{\beta_1}\) is the predicted change in \(y\) when \(x\) increases by one (all in the original units of \(x\) and \(y\) )
The slope estimate can be converted to a standardized slope estimate. The standardized slope estimate is the predicted change in \(y\) (in standard deviations) when \(x\) increases by one standard deviation.
The standardized intercept is by definition \(0\) (and thus trivial and not reported).
8.1.4 Relation to the correlation coefficient
In simple linear regression, the standardize slope estimate is equal to the Pearson correlation coefficient \(r_{XY}\).
In addition, the coefficient of determination \(R^2\) is here equal to the square of the correlation coefficient \(r_{XY}^2\)
This is not the case in multiple regression.
8.1.5 Effect size
The most common measure of effect size in this context is the standardized slope estimate / correlation coefficient.
| Standardized Slope (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.09 | Negligible/Null effect size |
| 0.10 to 0.29 | Small/weak effect size |
| 0.30 to 0.49 | Medium/Moderate effect size |
| 0.50 and above | Large/Strong effect size |
8.1.6 Test of the parameters
The intercept and the slope can be tested through (Wald) \(t\) tests, where:
\[ t = \frac{\beta}{SE(\beta)} \]
The null hypothesis is stated as:
\[ H_0 : \beta_\text{population} = 0 \]
The (non-directional) alternate hypothesis is stated as:
\[ H_1 : \beta \neq 0 \]
Under \(H_0\), \(t_\text{sample}\) follows a Student’s \(t\) distribution (centered around \(0\)), with degrees of freedom \(df=N-2\). The observed \(t\) is compared with that distribution, resulting in a \(p\) value.
The \(p\) value indicates the probability to observe a parameter at least as far from \(0\) as \(\hat{\beta}\), if \(H_0\) is true.
If \(p<.05\) , we generally say that the intercept or the slope is significant (or significantly different from \(0\).
If the slope is significant, we often say that the effect of \(x\) is significant.
8.1.7 \(R^2\) and \(F\) test
In this context, the coefficient of determination is redundant with the correlation coefficient, and is therefore rarely presented.
Similarly, the \(F\) test of model fit yields the same \(p\) value as the (two-tailed) Wald \(t\) test of the slope estimate, and is therefore redundant and rarely presented.
Although rarely reported in this context, they can be (see general section on linear models).
8.1.8 Typical reporting
To report the results of a simple linear regression, we generally report the slope estimate, the \(t\) value and the \(p\) value.
A simple linear regression was conducted to predict neuroticism from extraversion. Extraversion was a significant predictor of neuroticism, \(\beta = \beta_\text{observed}\), \(t(t_\text{df}) = t_\text{observed}\), \(p = p_\text{observed}\). The model accounted for \(R^2\)% of the variance in neuroticism.
8.1.9 Common graphical representation
We can represent the relation between \(x\) and \(y\) in a scatterplot, and add the regression line.
8.2 Multiple linear regression
Multiple linear regression is used for situations where we use several (\(p\)) numeric predictors (IV) \(X_1, X_2, ..., X_p\) used to predict one numeric outcome (DV) \(Y\).
The relation between the two variables is assumed to be linear:
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + + ... + \beta_px_{pi} + \epsilon_i\]
Multiple regression is thus a Linear Model.
8.2.1 Example situations in psychology
- We want to study the relation between the score on a personality trait (e.g., extraversion) and the score on a cognitive task (e.g., working memory capacity), while controlling for the effect of a third variable (e.g., age).
- We want to study how a set of personality traits (e.g., extraversion, openness, agreeableness) predict a psychological well-being scale score (e.g., depression, anxiety, stress).
- We want to study how a set of psychological well-being scale scores (e.g., depression, anxiety, stress) predict a behavioral outcome measured numerically (e.g., number of cigarettes smoked per day).
8.2.2 Parameter estimation
Like all linear models, the intercept and slope are estimated through ordinary least squares, which consist in minimizing the Sum of Squared Errors (\(SSE\)).
8.2.3 Interpretation
The (unstandardized) intercept estimate \(\hat{\beta_0}\) is the predicted value for \(y\) when \(x_1 = 0\) (all in the original units of \(x\) and \(y\) )
The (unstandardized) slope estimate of a predictor (e.g., \(\hat{\beta_1}\) corresponds to the slope for predictor \(X_1\)) is the predicted change in \(y\) when the predictor increases by one (all in the original units of \(x\) and \(y\)), holding all other predictors constant.
Holding all other predictors constant is a key aspect of multiple regression. It is also called controlling for the other predictors. An implication is that multiple regression is often used to control for confounding variables.
- Slope estimates can be converted to a standardized slope estimates. The standardized slope estimate is the predicted change in \(y\) (in standard deviations) when \(x\) increases by one standard deviation.
The standardized intercept is by definition \(0\) (and thus trivial and not reported).
8.2.4 Effect size
The most common measure of effect size in this context is the standardized slope estimate / correlation coefficient.
| Standardized Slope (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.09 | Negligible/Null effect size |
| 0.10 to 0.29 | Small/weak effect size |
| 0.30 to 0.49 | Medium/Moderate effect size |
| 0.50 and above | Large/Strong effect size |
8.2.5 Test of the parameters
The intercept and the slopes can be tested through (Wald) \(t\) tests, where:
\[ t = \frac{\beta}{SE(\beta)} \]
The null hypothesis is stated as:
\[ H_0 : \beta_\text{population} = 0 \]
The (non-directional) alternate hypothesis is stated as:
\[ H_1 : \beta \neq 0 \]
Under \(H_0\), \(t_\text{sample}\) follows a Student’s \(t\) distribution (centered around \(0\)), with degrees of freedom \(df=N-2\). The observed \(t\) is compared with that distribution, resulting in a \(p\) value.
The \(p\) value indicates the probability to observe a parameter at least as far from \(0\) as \(\hat{\beta}\), if \(H_0\) is true.
If \(p<.05\) , we generally say that the intercept or the slope is significant (or significantly different from \(0\).
If the slope is significant, we often say that the effect of \(x\) is significant, controlling for the other predictors.
8.2.6 \(R^2\) and \(F\) test
Contrary to simple regression, the \(R^2\) of multiple regression is not the square of the correlation coefficient, and is therefore not redundant with it. As a consequence, we typically report the \(R^2\). See Section 5.8 for formula.
Similarly, the \(F\) test of model fit is not redundant with the test of the slope estimate or correlation coefficient. It therefore is also typically reported. See Section 5.9 for formula.
8.2.7 Typical reporting
The Typical reporting for multiple regression is similar to that of simple regression, with the addition of the \(R^2\) and \(F\) test.
A multiple regression was conducted to predict agreeableness from extraversion, controlling for openness. Extraversion was a significant predictor of agreeableness, \(t(df) = t_\text{observed}, p < .001\), \(\beta = \hat{\beta}\), \(R^2 = \hat{R^2}\), \(F(df) = F_\text{observed}, p < .001\). The model accounted for \(\hat{R^2}\) of the variance in agreeableness.
9 Linear models with group predictors
9.1 Independent samples \(t\) test
The independent samples \(t\) test is used for situations where we want to compare two groups of observations (e.g., two groups of persons), using their means on an outcome variable. We will name these groups \(1\) and \(2\) throughout and the outcome variable \(Y\).
9.1.1 Example situations in psychology
- We want to study how a dichotomous (or dichotomized) demographic variable predicts a psychological well-being variable.
- We want to compare mean well-being scores of a group of persons in a group that received a treatment with the mean well-being scores of a group of persons in a group that did not receive the treatment.
- We want to study how groups placed in different experimental conditions differ on a behavioral outcome measured on a numerical scale (e.g., reaction time, number of errors, etc.).
9.1.2 Mean difference
A common way to describe the difference between the two means \(\bar{y}_1\) and \(\bar{y}_2\) is to compute a (raw) mean difference \(\overline{\Delta y}\):
\[\Delta \bar{y} = \bar{y}_2 - \bar{y}_1\]
This difference is expressed in the original units of the variable \(Y\).
9.1.3 Standardized mean difference
In a lot of cases in the social sciences, the unit of \(Y\) is arbitrary and/or meaningless. Consequently, it is frequent to express the mean difference in standard deviations. To do this, we compute the Standardized Mean Difference (\(SMD\), also referred to as Cohen’s \(d\)):
\[d = \frac{\Delta \bar{y}}{s_{\Delta y}}\]
where \(s_{\Delta y}\) is the standard deviation of the differences between the two groups:
\[SMD = d = \frac{\Delta \bar{y}}{s_\text{pooled}} = \frac{\bar{x}_2 - \bar{x}_1}{s_\text{pooled}}\]
The standardized mean difference is the most common measure of effect size in this context.
| Cohen’s d (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.19 | Negligible effect size |
| 0.20 to 0.39 | Small/weak effect size |
| 0.40 to 0.69 | Medium/Moderate effect size |
| 0.70 and above | Large/Strong effect size |
9.1.4 Null hypothesis
In an independent samples \(t\) test, the null hypothesis is formulated so as to imply no mean difference in the population:
\[H_0 : \mu_1 = \mu_2\]
Alternatively, we can write:
\[H_0 : \mu_2 - \mu_1 = 0\]
9.1.5 Alternate hypothesis
The non-directional hypothesis states that there is a mean difference in the population:
\[H_1 : \mu_1 \neq \mu_2\] Which can be written as:
\[H_1 : \mu_2 - \mu_1 \neq 0\]
A directional alternate hypothesis specifies a direction for that inequality:
\[H_1' : \mu_1 > \mu_2 \text{ or } H_1' : \mu_1 < \mu_2\]
If using a directional hypothesis, the direction must be specified prior to analysis.
9.1.6 The \(t\) statistic
Under \(H_0\), the mean difference is null in the population, which implies that the following \(t\) statistic…
\[t = \frac{\Delta \bar{y} }{SE_{\Delta \bar{y}}}\]
…follows a Student’s \(t\) distribution in the sample, with degrees of freedom \(df = N - 2\).
\[t \sim t(N-2)\]
\(SE_{\Delta \bar{y}}\) is the standard error of the mean difference.
9.1.7 \(p\) value
The observed \(t\) value is located in the \(t(N-2)\) distribution. The probability that \(t\) is greater in absolute value than the \(t_\text{observed}\) is the \(p\)-value.
- If \(p<.05\), we reject \(H_0\), and therefore conclude that the means are different in the population (i.e., there is a significant mean difference).
- If \(p>.05\), we cannot reject (or confirm) the null hypothesis, and thus cannot conclude (i.e., the mean difference is non-significant).
For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu_2 > \mu_1\) the \(p\)-value is the probability that \(t\) is greater than \(t_\text{observed}\). For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu_2 < \mu_1\) the \(p\)-value is the probability that \(t\) is smaller than \(t_\text{observed}\).
9.1.8 Relation to linear models
The independent-samples \(t\) test is equivalent to the Wald \(t\) test of the slope parameter in linear model, where the predictor is a dummy variable \(D_\text{Group = 2}\) that codes for group membership (e.g., \(X = 0\) for group 1 and \(X = 1\) for group 2), and the predicted variable is the outcome variable \(Y\). The model is:
\[y_i = \beta_0 + \beta_1 D_{\text{Group = 2},i}+ \epsilon_i\] In this linear model, \(\beta_1\) is the mean difference between the two groups, and the Wald \(t\) test of \(\beta_1\) is equivalent to the independent-samples \(t\) test.
9.1.9 Assumptions
The assumption of normality is here directly tested using the distribution of the \(Y\) variable separately in each group (significant normality tests indicates that the distribution is not normal in the population).
The assumption of homogeneity of variance is tested by comparing the variance of the \(Y\) variable in each group, typically using a Levene’s test (a significant result indicates that the variances are different in the population).
9.1.10 Common graphical representations
- Box plots
- Means with 95% Confidence Intervals
9.1.11 Typical reporting
The independent-samples \(t\) test is typically reported as:
A one-sample \(t\) test revealed that the mean difference between group 1 and group 2 was significant, \(t(N-2) = t_\text{observed}, p = p_\text{observed}\), \(d = d_\text{observed}\). The mean score was higher in group 1 (M = M1, SD = SD1) than in group 2 (M = M2, SD = SD2).
9.2 One-way ANOVA
One-way analysis of variance (ANOVA) is used for situations where we want to compare three or more groups of observations (e.g., three or more groups of persons), using their means on an outcome variable. We will name these groups \(1\), \(2\), \(3\), etc. throughout and the outcome variable \(Y\). We will label \(k\) the number of groups.
9.2.1 Example situations in psychology
- We want to study the effect of a drug on depression. We randomly assign participants to three groups: a placebo group, a low-dose group, and a high-dose group. We measure depression using a questionnaire. We want to know if the mean depression score is different in the three groups.
- We want to study the effect of different teaching methods on learning. We randomly assign participants to three groups: a control group, a video group, and a text group. We measure learning using a test. We want to know if the mean test score is different in the three groups.
9.2.2 Sums of squares
In one-way ANOVA, we use sums of squares to quantify the variability in the data and how it is explained by the group predictor.
The total sum of squares \(SS_\text{total}\) is the sum of the squared deviations of each observation from the overall mean. It is a measure of the total variability in the data.
The between-groups sum of squares \(SS_\text{between}\) is the sum of the squared deviations of each group mean from the overall mean. It is a measure of the variability between groups.
The within-groups sum of squares \(SS_\text{within}\) is the sum of the squared deviations of each observation from its group mean. It is a measure of the variability within groups.
\[SS_\text{total} = SS_\text{between} + SS_\text{within}\]
9.2.3 \(R^2\) and \(\eta^2\)
The \(R^2\) of a one-way ANOVA is the ratio of the between-groups sum of squares to the total sum of squares:
\[R^2 = \frac{SS_\text{between}}{SS_\text{total}}\]
In one-way ANOVA, the \(R^2\) is also called the eta-squared (\(\eta^2\)) and is interpreted as the proportion of variance in the outcome variable that is explained by the group predictor. It is a measure of the effect size of the group predictor.
| \(\eta^2\) | Interpretation |
|---|---|
| < .01 | Negligible/Null effect size |
| .01 to .06 | Small/weak effect size |
| .06 to .13 | Medium/Moderate effect size |
| .14 and above | Large/Strong effect size |
9.2.4 Null hypothesis
In a independent samples \(t\) test, the null hypothesis is formulated so as to imply no mean difference in the population:
\[H_0 : \mu_1 = \mu_2 = ... = \mu_k\]
9.2.5 Alternate hypothesis
The alternate hypothesis states that there is a mean difference between at least two groups in the population. For example if there are three groups, the non-directional hypothesis is:
\[H_1 : \mu_1 \neq \mu_2 \text{ or } \mu_1 \neq \mu_3 \text{ or } \mu_2 \neq \mu_3\]
The alternate hypothesis does not specify which group means are different from each other. It only states that at least two group means are different from each other. This is why the test is called omnibus.
9.2.6 The \(F\) statistic
Under \(H_0\), the following \(F\) statistic…
\[F = \frac{MS_\text{between}}{MS_\text{within}} = \frac{SS_\text{between} / df_\text{between}}{SS_\text{within} / df_\text{within}}\]
…follows an \(F\) distribution with \(df_\text{between} = k - 1\) and \(df_\text{within} = N - k\) degrees of freedom.
9.2.7 \(p\) value
The observed \(F\) value is located in the \(F(k - 1, N-k)\) distribution. The probability that \(F\) is greater than the \(F_\text{observed}\) is the \(p\)-value.
- If \(p<.05\), we reject \(H_0\), and therefore conclude that there are significant differences between at least two groups.
- If \(p>.05\), we cannot reject (or confirm) the null hypothesis, and thus cannot conclude (i.e., we do not know if there are mean differences in the population).
9.2.8 Post-hoc tests
When the null hypothesis is rejected, we know that there are significant differences between at least two groups. However, we do not know which groups are different from each other. To test this, we can perform post-hoc tests. The most common post-hoc tests are the Tukey HSD tests.
| Mean difference | Lower bound of CI | Upper bound of CI | Adjusted p-value | |
|---|---|---|---|---|
| 2-1 | 3.07 | -3.95 | 10.09 | 0.55 |
| 3-1 | 9.65 | 2.63 | 16.67 | 0.00 |
| 3-2 | 6.58 | -0.44 | 13.60 | 0.07 |
9.2.9 Relation to linear models
The one-way ANOVA can be seen as a special case of the linear model, where the categorical predictor \(X\) is represented by a set of \(k-1\) dummy variables \(D_{X=2}, D_{X=3}, ..., D_{X=k}\). A dummy variable is a numeric variable that only takes values \(0\) or \(1\), and codes for membership in a group (e.g., \(D_{X=2} = 1\) if \(X = 2\) and \(D_{X=2} = 0\) if \(X \neq 2\)).
\[y_i = \beta_0 + \beta_1 D_{X=2,i} + \beta_2 D_{X=3,i} + ... + \beta_{k-1} D_{X=k,i} + \epsilon_i\]
Alternative coding schemes are possible, but in this example, the first group is used as baseline and is therefore not assigned a dummy variable. This is why there are \(k-1\) dummy variables.
In this linear model, \(\beta_0\) is the mean in group 1, \(\beta_1\) is the mean difference between group 1 and group 2, and \(\beta_2\) is the mean difference between group 1 and group 3. The \(F\) test of model fit tests the null hypothesis that all slopes are null, which is here equivalent to there being no mean difference between groups. It is thus a reformulation of the \(F\) test of a one-way ANOVA.
9.2.10 Assumptions
The assumption of normality is here directly tested using the distribution of the \(Y\) variable separately in each group (significant normality tests indicates that the distribution is not normal in the population).
The assumption of homogeneity of variance is tested by comparing the variance of the \(Y\) variable in each group, typically using a Levene’s test (a significant result indicates that the variances are different in the population).
9.2.11 Common graphical representations
- Box plots
- Means with 95% Confidence Intervals
9.2.12 Typical reporting
The one-way ANOVA is typically reported as follows:
A one-way ANOVA revealed a significant effect of Group on Scores, \(F(k-1, N-k) = F_\text{observed}, p = p_\text{observed}, \eta^2 = \eta^2_\text{observed}\). Post-hoc tests revealed that the mean difference between group 1 and group 2 was significant, \(p = p_\text{observed}\), and that the mean difference between group 1 and group 3 was significant, \(p = p_\text{observed}\). The mean difference between group 2 and group 3 was not significant, \(p = p_\text{observed}\).
10 Null hypothesis significance testing
10.1 Overview
Null hypothesis significance testing (NHST) is a statistical procedure that is used to conclude on the presence of effects in the population (i.e., it is an inferential procedure). More broadly, it can also be used to determine if a statistic of interest is significantly different from a reference value (e.g., comparing a mean or skewness to \(0\)).
- We first specify a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_1\)). The null hypothesis is a statement that there is no effect (or in general no difference between the statistic studied and the reference value). The alternate hypothesis is a statement that there is an effect (or more generally, a difference between the statistic studied and the reference value).
- e.g., \(H_0: \mu_1 = \mu_2\) and \(H_1: \mu_1 \neq \mu_2\) (independent samples \(t\) test)
- Note: The null hypothesis and the alternate hypothesis are by definition complementary. If the null hypothesis is rejected, the alternate hypothesis is accepted. Also, note that both the null and alternate hypotheses are always statements about the population, not the sample.
- We then compute a test statistic (e.g., \(t\), \(F\), \(\chi^2\)) from the data. This test statistic, in general, quantifies the difference between the observed data and what the null hypothesis predicts. The choice of the test statistic depends on the type of data and the research question.
- e.g., \(t_\text{observed} = \frac{\bar{X}_2 - \bar{X}_1}{SE_{\bar{X}_2 - \bar{X}_1}}\) (independent samples \(t\) test)
- By definition, the test statistic chosen has a known sampling distribution under the null hypothesis (i.e., if the null hypothesis is true, we know the probability distribution of the test statistic in the sample). The statistic takes the name of the sampling distribution (e.g., \(t\) distribution, \(F\) distribution, \(\chi^2\) distribution). Note that some of these distributions (\(t\), \(F\), \(\chi^2\)) have one or several distributional parameter(s), referred to as degrees of freedom, that depend on the sample size and the research question.
- e.g., \(t_\text{sample} \sim t(df)\) (independent samples \(t\) test)
- We then use the sampling distribution of the test statistic to compute the probability of observing a test statistic at least as extreme as the one that has been observed if the null hypothesis were true. This probability is called the \(p\)-value. In other words, the \(p\)-value is the probability of observing a test statistic as extreme as the one observed, under the null hypothesis.
- e.g., \(p = P(|t_\text{observed}| > |t|)\) (two-tailed independent samples \(t\) test)
- Finally, we compare the \(p\)-value to a pre-specified threshold, called the significance level (\(\alpha\)). The most common significance level is \(0.05\). If the \(p\)-value is less than the significance level, we reject the null hypothesis, and therefore confirm the alternate hypothesis. We therefore conclude that there is an effect (or more generally a difference between the statistic studied and the reference value). We often say that the effect (or the difference observed) is statistically significant. If the \(p\)-value is greater than the significance level, we fail to reject the null hypothesis (i.e., we retain it, but do not confirm it), and therefore cannot conclude as to whether there is an effect (or more generally a difference between the statistic studied and the reference value) in the population. We often say that the effect (or the difference) is not statistically significant.
- e.g., For an independent samples \(t\) test, if \(p < \alpha\), we reject \(H_0\) and conclude that the mean difference between group 1 and group 2 is significant. If \(p > \alpha\), we fail to reject \(H_0\) and conclude that the mean difference between group 1 and group 2 is not significant.
10.2 Important notes and implications
We never accept the null hypothesis. We either reject it or fail to reject it. This is because we can never prove that the null hypothesis is true. We can only provide evidence against it.
The \(p\)-value is not the probability that the null hypothesis is true. It is the probability of observing a test statistic as extreme as the one observed, under the null hypothesis. It is used as a measure of the strength of the evidence against the null hypothesis.
All else being equal, a larger sample size will lead to a smaller \(p\)-value. However, a large sample size does not guarantee a small \(p\)-value. For example, if the effect size is small, the \(p\)-value may still be large.
All else being equal, a larger effect size will lead to a smaller \(p\)-value. However, a large effect size does not guarantee a small \(p\)-value. For example, if the sample size is small, the \(p\)-value may still be large.
10.3 Error types and statistical power
In hypothesis testing, we can make two types of errors:
- Type I error: Rejecting the null hypothesis when it is true (false detection/false positive). The typical significance level (\(\alpha = .05\)) indicates that we are willing to accept a 5% chance of making a Type I error.
- Type II error: Failing to reject the null hypothesis when it is false (missed detection/false negative). The statistical power of a test is the probability of correctly rejecting the null hypothesis when it is false (\(1-\) the probability of a Type II error).